Before we can start with our machine learning model we need to understand the relationship between the two variables, therefore we should calculate covariance. This measures the direction of a relationship between the two variables.
library(tidyverse)
library(padr)
library(tidytext)
library(tm)
library(writexl)
library(plyr)
library(lubridate)
library(anytime)
library(plotly)
library(dplyr)
library(corrplot)
library(formatR)
library(htmltools)
library(ggpubr)
library(textstem)
library(vader)
library(zoo)
btc_exchange_rate_history <- read.csv("D:/Suli/Szakdolgozat1/development_n_stuff/aggregated_data.csv") %>%
select(-X) %>%
mutate(created_at = as_date(created_at))
btc_usd_tweets_combined <- read.csv("D:/Suli/Szakdolgozat1/clean data/Bitcoin_exchage_rates.csv") %>%
mutate(Date = as_date(Date)) %>%
mutate(year = year(Date), month = month(Date), day = day(Date)) %>%
mutate(Date = make_date(year, month, day)) %>%
mutate(xchange_rate_change = Close - Open) %>%
filter(Date >= "2019-01-01" & Date <= "2022-03-30") %>%
inner_join(btc_exchange_rate_history, by = c(Date = "created_at")) %>%
select(c(-year, -month, -day))
btc_sent_lineplot <- plot_ly(data = btc_usd_tweets_combined, x = ~Date, y = ~xchange_rate_change,
name = "BTC change", type = "scatter", mode = "lines", color = "red") %>%
add_trace(data = btc_usd_tweets_combined, x = ~Date, y = ~daily_avg_sent, yaxis = "y2",
name = "Avg. sentiment", mode = "lines", color = "blue") %>%
layout(title = "Bitcoin exchange rate change compared to previous day's closing rate",
margin = list(t = 150), legend = list(x = 1.1), paper_bgcolor = "rgb(255, 255, 255)",
plot_bgcolor = "rgb(255, 255, 255)", xaxis = list(title = "Date", range = list("2019-01-01 00:00:00",
"2019-12-31 23:59:59"), rangeslider = list(type = "date", visible = T),
list(dtickrange = list(NULL, 1000), value = "%H:%M:%S.%L ms"), list(dtickrange = list(1000,
60000), value = "%H:%M:%S s"), list(dtickrange = list(60000, 3600000),
value = "%H:%M m"), list(dtickrange = list(3600000, 86400000), value = "%H:%M h"),
list(dtickrange = list(86400000, 604800000), value = "%e. %b d"), list(dtickrange = list(604800000,
"M1"), value = "%e. %b w"), list(dtickrange = list("M1", "M12"),
value = "%b '%y M"), list(dtickrange = list("M12", NULL), value = "%Y Y"),
rangeselector = list(buttons = list(list(count = 1, label = "1M", step = "month",
stepmode = "backward"), list(count = 6, label = "6M", step = "month",
stepmode = "backward"), list(count = 1, label = "1Y", step = "year",
stepmode = "backward"), list(count = 1, label = "YTD", step = "year",
stepmode = "todate"), list(step = "all", label = "ALL"))), list(dtick = "M1",
tickformat = "%b\n%Y", ticklabelmode = "period")), yaxis = list(title = "BTC exchange rate change",
range = c(min(btc_usd_tweets_combined$xchange_rate_change), max(btc_usd_tweets_combined$xchange_rate_change)),
gridcolor = "rgb(255,255,255)", showgrid = TRUE, showline = FALSE, showticklabels = TRUE,
tickcolor = "rgb(140, 140, 140)", ticks = "outside", zeroline = FALSE),
yaxis2 = list(title = "Daily average sentiment", overlaying = "y", side = "right",
range = c(min(btc_usd_tweets_combined$daily_avg_sent), max(btc_usd_tweets_combined$daily_avg_sent))))
btc_sent_lineplot
plot(density(btc_exchange_rate_history$daily_avg_sent), type = "n", main = "Distribution of average daily sentiment")
polygon(density(btc_exchange_rate_history$daily_avg_sent), col = "red", border = "gray")
plot(density(btc_usd_tweets_combined$xchange_rate_change), type = "n", main = "Distribution of Bitcoin exchange rate change")
polygon(density(btc_usd_tweets_combined$xchange_rate_change), col = "red", border = "gray")
These seem like pretty standard normal distributions, so we can use Pearson Correlation Coefficient calculation later on.
btc_sent_scatterplot <- plot_ly(data = btc_usd_tweets_combined, y = ~daily_avg_sent,
x = ~xchange_rate_change, marker = list(size = 4, color = "rgba(255, 182, 193, .9)",
line = list(color = "rgba(152, 0, 0, .8)", width = 1))) %>%
layout(yaxis = list(title = "Daily average sentiment"), xaxis = list(title = "BTC exchange rate change (USD)")) %>%
htmltools::div(align = "center")
btc_sent_scatterplot
The plot shows that there is no distinguishable relationship between these variables, but I still wanted to do the covariance and correlation calculation to get the mathematical results.
btc_cov <- cov(btc_usd_tweets_combined$daily_avg_sent, btc_usd_tweets_combined$xchange_rate_change,
method = "pearson")
btc_cov
## [1] 3.301177
A positive covariance means that the two variables tend to increase or decrease together. Correlation helps us analyze the effect of changes made in one variable over the other variable of the dataset. Now that we know this, we should calculate the strength of the relationship between two, numerically measured, continuous variables.
btc_cor <- cor(btc_usd_tweets_combined$daily_avg_sent, btc_usd_tweets_combined$xchange_rate_change,
method = "pearson")
btc_cor
## [1] 0.1162632
One of the most common ways to quantify a relationship between two variables is to use the Pearson correlation coefficient, which is a measure of the linear association between two variables.
It always takes on a value between -1 and 1 where:
- -1 indicates a perfectly negative linear correlation between two variables
- 0 indicates no linear correlation between two variables
- 1 indicates a perfectly positive linear correlation between two variables
Often denoted as r, this number helps us understand the strength of the relationship between two variables. The closer r is to zero, the weaker the relationship between the two variables.
A weak correlation indicates that there is minimal relationship between the variables.
After reading some scientific papers I concluded that continuing down this path would bear no plausible outcome, so I decided to look at some other trend measures that seem promising.
Downloaded the data for Google search popularity of Bitcoin through Trendecon and Google Trends API. Because of the limitations of standard Google Trends data I had to use the Trendecon package’s rts_gtrends_mwd function to build a consistent daily time series.
Construct a robust and consistent daily Time Series from Google Trends data. Daily, weekly and monthly Data is downloaded and consistently aggregated, using the Chow-Lin methodology.
btc_ggl_trends <- read.csv("D:/Suli/Szakdolgozat1/data_to_be_cleaned/btc_google_trends.csv")
head(btc_ggl_trends)
## X time value
## 1 1 2019-01-01 1.054060
## 2 2 2019-01-02 17.290539
## 3 3 2019-01-03 19.200650
## 4 4 2019-01-04 15.380214
## 5 5 2019-01-05 7.739413
## 6 6 2019-01-06 7.739342
Checking distribution:
plot(density(btc_ggl_trends$value), type = "n", main = "Distribution of Google Trends data")
polygon(density(btc_ggl_trends$value), col = "red", border = "gray")
btc_usd_ggltrnds_combined <- read.csv("D:/Suli/Szakdolgozat1/clean data/Bitcoin_exchage_rates.csv") %>%
inner_join(btc_ggl_trends, by = c(Date = "time")) %>%
mutate(xchange_rate_change = Close - Open) %>%
mutate(value = ifelse(xchange_rate_change < 0, value * -1, value)) %>%
select(-X) %>%
dplyr::rename(Trends_indicator = value)
btc_ggltrnds_scatterplot <- plot_ly(data = btc_usd_ggltrnds_combined, y = ~Trends_indicator,
x = ~xchange_rate_change, marker = list(size = 4, color = "rgba(255, 182, 193, .9)",
line = list(color = "rgba(152, 0, 0, .8)", width = 1))) %>%
layout(yaxis = list(title = "Google Trends indicator"), xaxis = list(title = "BTC exchange rate change (USD)")) %>%
htmltools::div(align = "center")
btc_ggltrnds_scatterplot
Plotting the absolute values shows the linearity between Google Trends indicator and BTC exchange rate change:
ggl_trends_abs_scatterplt <- plot_ly(data = btc_usd_ggltrnds_combined, y = ~abs(Trends_indicator),
x = ~abs(xchange_rate_change), marker = list(size = 4, color = "rgba(255, 182, 193, .9)",
line = list(color = "rgba(152, 0, 0, .8)", width = 1))) %>%
layout(yaxis = list(title = "Google Trends indicator"), xaxis = list(title = "BTC exchange rate change (USD)")) %>%
htmltools::div(align = "center")
ggl_trends_abs_scatterplt
ggscatter(btc_usd_ggltrnds_combined, x = "xchange_rate_change", y = "Trends_indicator",
add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "pearson", xlab = "BTC exchange rate change (USD)",
ylab = "Google Trends indicator")
When you perform a statistical test a p-value helps you determine the significance of your results in relation to the null hypothesis. A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Therefore, we reject the null hypothesis, and accept the alternative hypothesis. The alternative hypothesis states that the independent variable did affect the dependent variable, and the results are significant in terms of supporting the theory being investigated (i.e. not due to chance).
Covariance:
print(cov(btc_usd_ggltrnds_combined$Trends_indicator, btc_usd_ggltrnds_combined$xchange_rate_change,
method = "pearson"))
## [1] 31092.77
Correlation:
print(cor(btc_usd_ggltrnds_combined$Trends_indicator, btc_usd_ggltrnds_combined$xchange_rate_change,
method = "pearson"))
## [1] 0.7332396
There is a moderate uphill (positive) relationship.
Chosen to improve model accuracy:
We need to combine all these to get a compact, usable dataframe. First things first: Reddit submissions data cleaning.
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.
lemmatize_words_in_string <- function(col) {
for (i in 1:length(col)) {
col[i] <- col[i] %>%
strsplit(" ") %>%
unlist() %>%
lemmatize_words() %>%
paste(collapse = " ")
}
return(col)
}
path <- "D:/Suli/Szakdolgozat1/data_to_be_cleaned/reddit_adat/bitcoin_sub"
bitcoin_sub_df <- list.files(path, pattern = "*.rds", full.names = TRUE) %>%
map_dfr(readRDS)
clean_bitcoin_sub_df <- bitcoin_sub_df %>%
filter(author != "rBitcoinMod", author != "crypto_bot", selftext != "[removed]",
selftext != "[deleted]") %>%
mutate(selftext = paste(title, selftext, sep = " ")) %>%
select(-title) %>%
mutate(selftext = tolower(selftext)) %>%
mutate(selftext = lemmatize_words_in_string(selftext)) %>%
mutate(selftext = gsub("[^\001-\177]", "", selftext)) %>%
mutate(selftext = gsub("btc|xbt", "bitcoin", selftext)) %>%
mutate(selftext = gsub("cryptocurrency|cryptocurrencies", "crypto", selftext)) %>%
mutate(selftext = gsub("https?:\\/\\/\\S+", "", selftext)) %>%
mutate(selftext = gsub("\\$|usd|dollars", "dollar", selftext)) %>%
mutate(selftext = gsub("[[:punct:]]* *(\\w+[&'-]\\w+)|[[:punct:]]+ *| {2,}",
" \\1", selftext)) %>%
mutate(created_utc = anydate(created_utc))
Looking at most frequent words used in the submissions:
stopwords <- data.frame(word = stopwords("eng"))
misc_df_word_freq <- clean_bitcoin_sub_df %>%
unnest_tokens(word, selftext) %>%
anti_join(stopwords)
## Joining, by = "word"
frequent_words <- misc_df_word_freq %>%
select(word) %>%
dplyr::count(word) %>%
arrange(desc(n)) %>%
slice(-1, ) %>%
top_n(20)
## Selecting by n
fig <- plot_ly(data = frequent_words, x = ~word, y = ~n, type = "bar") %>%
layout(xaxis = list(categoryorder = "total descending"))
fig
Choose words which could be price indicative: buy, get, price, good, like, make, money, crypto, wallet and group the occurences of these by date. Interpolate missing values.
toMatch <- c("buy", "get", "price", "good", "like", "make", "money", "crypto", "wallet")
word_matches <- filter(misc_df_word_freq, word %in% toMatch) %>%
select(created_utc, word) %>%
group_by(created_utc, word) %>%
summarise(count = n()) %>%
ungroup() %>%
spread(word, count) %>%
mutate(across(where(is.numeric), ~ifelse(is.na(.), 0, .))) %>%
pad() %>%
mutate_at(toMatch, na.approx) %>%
mutate_at(toMatch, round)
## `summarise()` has grouped output by 'created_utc'. You can override using the
## `.groups` argument.
## pad applied on the interval: day
Plot of occurences over time:
fig <- plot_ly(data = word_matches, x = ~created_utc, y = ~buy, name = "Buy", type = "scatter",
mode = "lines", height = 500) %>%
layout(title = "Number of occurences of possible price indicative words", paper_bgcolor = "rgb(255, 255, 255)",
plot_bgcolor = "rgb(186, 186, 186)", xaxis = list(title = "Date", range = list("2022-03-16 00:00:00",
"2022-03-30 23:59:59"), rangeslider = list(type = "date", visible = T),
list(dtickrange = list(NULL, 1000), value = "%H:%M:%S.%L ms"), list(dtickrange = list(1000,
60000), value = "%H:%M:%S s"), list(dtickrange = list(60000, 3600000),
value = "%H:%M m"), list(dtickrange = list(3600000, 86400000), value = "%H:%M h"),
list(dtickrange = list(86400000, 604800000), value = "%e. %b d"), list(dtickrange = list(604800000,
"M1"), value = "%e. %b w"), list(dtickrange = list("M1", "M12"),
value = "%b '%y M"), list(dtickrange = list("M12", NULL), value = "%Y Y"),
rangeselector = list(buttons = list(list(count = 1, label = "1M", step = "month",
stepmode = "backward"), list(count = 6, label = "6M", step = "month",
stepmode = "backward"), list(count = 1, label = "1Y", step = "year",
stepmode = "backward"), list(count = 1, label = "YTD", step = "year",
stepmode = "todate"), list(step = "ALL"))), list(dtick = "M1", tickformat = "%b\n%Y",
ticklabelmode = "period")), yaxis = list(title = "Number of occurences",
gridcolor = "rgb(255,255,255)", showgrid = TRUE, showline = FALSE, showticklabels = TRUE,
tickcolor = "rgb(140, 140, 140)", ticks = "outside", zeroline = FALSE)) %>%
add_trace(y = ~crypto, name = "Crypto", mode = "lines") %>%
add_trace(y = ~buy, name = "Wallet", mode = "lines") %>%
add_trace(y = ~get, name = "Get", mode = "lines") %>%
add_trace(y = ~good, name = "Good", mode = "lines") %>%
add_trace(y = ~like, name = "Like", mode = "liness") %>%
add_trace(y = ~make, name = "Make", mode = "lines") %>%
add_trace(y = ~money, name = "Money", mode = "lines") %>%
add_trace(y = ~price, name = "Price", mode = "lines") %>%
add_trace(y = ~wallet, name = "Wallet", mode = "lines")
fig
rm(fig)
Correlation between the chosen words and Bitcoin exchange:
## [1] "crypto-closing price:"
## [1] 0.519687
## [1] "make-closing price:"
## [1] 0.5553536
## [1] "buy-closing price:"
## [1] 0.4881181
## [1] "get-closing price:"
## [1] 0.5237107
## [1] "good-closing price:"
## [1] 0.5428107
## [1] "like-closing price:"
## [1] 0.5617066
## [1] "money-closing price:"
## [1] 0.4393078
## [1] "make-closing price:"
## [1] 0.5553536
## [1] "price-closing price:"
## [1] 0.4881181
## [1] "wallet-closing price:"
## [1] 0.4174382
Only moderate correlation has been found. Lastly, we should see if we find any correlation between Reddit sentiments and Bitcoin price change.
splitting_func <- function(string) {
strsplit(string, "(?<=[?.!\"\n])\\s?(?=[a-z])", perl = T)
}
sent_bitcoin_sub_df <- bitcoin_sub_df %>%
filter(author != "rBitcoinMod", author != "crypto_bot", selftext != "[removed]",
selftext != "[deleted]") %>%
mutate(selftext = paste(title, selftext, sep = ".")) %>%
select(c(-num_comments, -title, -score, -subreddit, -id, -author)) %>%
mutate(selftext = tolower(selftext)) %>%
mutate(selftext = lemmatize_words_in_string(selftext)) %>%
mutate(selftext = gsub("[^\001-\177]", "", selftext)) %>%
mutate(selftext = gsub("btc|xbt", "bitcoin", selftext)) %>%
mutate(selftext = gsub("cryptocurrency|cryptocurrencies", "crypto", selftext)) %>%
mutate(selftext = gsub("https?:\\/\\/\\S+", "", selftext)) %>%
mutate(selftext = gsub("\\$|usd|dollars", "dollar", selftext)) %>%
mutate(created_utc = anydate(created_utc)) %>%
unnest_tokens(tbl = ., input = selftext, output = sentences, token = splitting_func) %>%
mutate(sentences = gsub("[[:punct:]]* *(\\w+[&'-]\\w+)|[[:punct:]]+ *| {2,}",
" \\1", sentences))
reddit_vader_df <- readRDS("D:/Suli/Szakdolgozat1/data_to_be_cleaned/reddit_adat/reddit_vader_df.rds") %>%
cbind(sent_bitcoin_sub_df) %>%
select(c(compound, pos, neu, neg, -sentences, created_utc)) %>%
mutate(across(c(pos, neg, neu), ~replace_na(., 0))) %>%
mutate(positive = case_when(pos >= 0.2 ~ 1, pos < 0.2 ~ 0), negative = case_when(neg >=
0.2 ~ 1, neg < 0.2 ~ 0)) %>%
select(created_utc, positive, negative) %>%
group_by(created_utc) %>%
summarise(total_positive = sum(positive), total_negative = sum(negative)) %>%
pad() %>%
mutate(across(c(total_positive, total_negative), na.approx)) %>%
mutate(across(c(total_positive, total_negative), round))
## pad applied on the interval: day
btc_usd_ggltrnds_combined <- btc_usd_ggltrnds_combined %>%
filter(row_number() <= n() - 1) %>%
cbind(reddit_vader_df) %>%
select(-created_utc)
plot(density(btc_usd_ggltrnds_combined$total_positive), type = "n", main = "Distribution of total positives")
polygon(density(btc_usd_ggltrnds_combined$total_positive), col = "red", border = "gray")
Correlation between Bitcoin price change and total positives/negatives:
print(cor(btc_usd_ggltrnds_combined$Close, btc_usd_ggltrnds_combined$total_positive))
## [1] 0.5899299
print(cor(btc_usd_ggltrnds_combined$Close, btc_usd_ggltrnds_combined$total_negative))
## [1] 0.5286992
misc_df <- sent_bitcoin_sub_df %>%
mutate(n = 1) %>%
select(-sentences) %>%
group_by(created_utc) %>%
summarise(number_of_posts = sum(n)) %>%
pad() %>%
mutate(number_of_posts = round(na.approx(number_of_posts))) %>%
cbind(btc_usd_ggltrnds_combined) %>%
select(-Date)
## pad applied on the interval: day
plot(density(misc_df$number_of_posts), type = "n", main = "Distribution of total positives")
polygon(density(misc_df$number_of_posts), col = "red", border = "gray")
print(cor(misc_df$Close, misc_df$number_of_posts))
## [1] 0.5995405
Well, this shows a moderate amount of correlation between the number of Reddit posts and Bitcoin closing price. Now with Tweet amount:
tweets <- read_csv("D:/Suli/Szakdolgozat1/data_to_be_cleaned/twitter_adat/n_of_tweets_#bitcoin.csv")
## Rows: 3104 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (1): V2
## date (1): V1
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
formatted_tweets <- tweets %>%
rename(date = V1) %>%
rename(number_of_tweets = V2)
df_with_nr_of_tweets <- misc_df %>%
inner_join(formatted_tweets, by = c(created_utc = "date")) %>%
select(c(-total_positive, -total_negative)) %>%
mutate(number_of_tweets = round(na.approx(number_of_tweets)))
print(cor(df_with_nr_of_tweets$Close, df_with_nr_of_tweets$number_of_tweets, method = "pearson"))
## [1] 0.8149475
Incorporating all features into the dataset, then making a correlation matrix to check for multicollinearity:
temp <- do.call(cbind, lapply(list.files(path = "D:/Suli/Szakdolgozat1/data_to_be_cleaned/",
pattern = "*.xlsx", full.names = TRUE), readxl::read_xlsx)) %>%
select(c(Nr_of_Bitcoin_transactions, Nr_of_unique_Bitcoin_addresses, Bitcoin_transaction_volume,
Wikipedia_searches_Bitcoin))
final_df <- df_with_nr_of_tweets %>%
cbind(temp) %>%
select(c(-Open, -Low, -High, -Adj.Close, -Volume)) %>%
dplyr::rename(Price_change = xchange_rate_change, Date = created_utc, Nr_of_tweets = number_of_tweets,
Nr_of_Reddit_posts = number_of_posts, Closing_price = Close, Google_Trends_indicator = Trends_indicator)
cor(final_df[, 2:10])
## Nr_of_Reddit_posts Closing_price
## Nr_of_Reddit_posts 1.00000000 0.599540526
## Closing_price 0.59954053 1.000000000
## Google_Trends_indicator -0.02114934 -0.009710201
## Price_change 0.01547807 0.028806174
## Nr_of_tweets 0.53459333 0.814947482
## Nr_of_Bitcoin_transactions -0.18030314 -0.473484692
## Nr_of_unique_Bitcoin_addresses 0.52469752 0.551212924
## Bitcoin_transaction_volume 0.68657114 0.507942237
## Wikipedia_searches_Bitcoin 0.65723075 0.633144432
## Google_Trends_indicator Price_change
## Nr_of_Reddit_posts -0.0211493436 0.015478072
## Closing_price -0.0097102007 0.028806174
## Google_Trends_indicator 1.0000000000 0.733002214
## Price_change 0.7330022142 1.000000000
## Nr_of_tweets -0.0002901486 0.016924622
## Nr_of_Bitcoin_transactions 0.0560514757 0.028800121
## Nr_of_unique_Bitcoin_addresses 0.0297285971 0.025437223
## Bitcoin_transaction_volume 0.0037480295 0.027803306
## Wikipedia_searches_Bitcoin -0.0295092837 -0.007716001
## Nr_of_tweets Nr_of_Bitcoin_transactions
## Nr_of_Reddit_posts 0.5345933334 -0.18030314
## Closing_price 0.8149474816 -0.47348469
## Google_Trends_indicator -0.0002901486 0.05605148
## Price_change 0.0169246219 0.02880012
## Nr_of_tweets 1.0000000000 -0.48788467
## Nr_of_Bitcoin_transactions -0.4878846702 1.00000000
## Nr_of_unique_Bitcoin_addresses 0.4654968124 0.24242612
## Bitcoin_transaction_volume 0.4283644670 -0.07026457
## Wikipedia_searches_Bitcoin 0.5999385283 -0.24902695
## Nr_of_unique_Bitcoin_addresses
## Nr_of_Reddit_posts 0.52469752
## Closing_price 0.55121292
## Google_Trends_indicator 0.02972860
## Price_change 0.02543722
## Nr_of_tweets 0.46549681
## Nr_of_Bitcoin_transactions 0.24242612
## Nr_of_unique_Bitcoin_addresses 1.00000000
## Bitcoin_transaction_volume 0.51692135
## Wikipedia_searches_Bitcoin 0.44445939
## Bitcoin_transaction_volume
## Nr_of_Reddit_posts 0.68657114
## Closing_price 0.50794224
## Google_Trends_indicator 0.00374803
## Price_change 0.02780331
## Nr_of_tweets 0.42836447
## Nr_of_Bitcoin_transactions -0.07026457
## Nr_of_unique_Bitcoin_addresses 0.51692135
## Bitcoin_transaction_volume 1.00000000
## Wikipedia_searches_Bitcoin 0.62343595
## Wikipedia_searches_Bitcoin
## Nr_of_Reddit_posts 0.657230750
## Closing_price 0.633144432
## Google_Trends_indicator -0.029509284
## Price_change -0.007716001
## Nr_of_tweets 0.599938528
## Nr_of_Bitcoin_transactions -0.249026953
## Nr_of_unique_Bitcoin_addresses 0.444459386
## Bitcoin_transaction_volume 0.623435947
## Wikipedia_searches_Bitcoin 1.000000000
The correlation matrix doesn’t show anything outstanding, we can assess feature importance in the ML notebook.
Exporting final dataframe:
write_xlsx(final_df, "D:/Suli/Szakdolgozat1/clean data/data_for_training.xlsx")
With all of these in order we can start tweaking the Prophet model.